Introduction
In this case study, we analyzed two data sets provided by our client Budweiser, the Beers data set contains a list of 2410 US craft beers and Breweries data set contains 558 US breweries. The purpose of the code is to find statistical characters of IBU and ABV data, also the correlation between them to address questions from Budweiser.
Repository Structure
Case study Repository in GitHub: https://github.com/ysalame98/Project1 Files: Reers.csv: Source beer data. Breweries.csv: Source breweries data. AlcoholAbuse.csv: US Dept OF HHS https://www.samhsa.gov/data/report/2016-2017-nsduh-state-specific-tables EDAforProject1_R5.RMD: R Markdown file with all R code. EDAforProject1_R5.HTML: HTML file created with Knit.
Analysis:
Data Cleansing and Tidy:
- Identify missing and develop a strategy to resolve
- Identify any duplicates entries
- Since Breweries file has city names look for slight name variations and resolve
Initial Data Profiling
Q1: How many breweries are present in each state?
The number of breweries in 51 states has been calculated and provided in the below table along with a column plot and Tree Plot
##
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO
## 7 3 2 11 39 47 8 1 2 15 7 4 5 5 18 22 3 4 5 23 7 9 32 12 9
## MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV
## 2 9 19 1 5 3 3 4 2 16 15 6 29 25 5 4 1 3 28 4 16 10 23 20 1
## WY
## 4
2. Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.
We merged the two data set into one and named it as ‘mergeDF’, the primary key being used is ‘Brewery_id’ from Beer data set, and ‘Brew_ID’ from Breweries data set. We also changed the two columns’ name for clear understanding. The first and last 6 observations were showed there with head/tail command.
mergeDF <- merge(BeersDF, BreweriesDF, by.x = "Brewery_id", by.y = "Brew_ID")
colnames(mergeDF)[2] = "Beer_name"
colnames(mergeDF)[8] = "Brewery_name"
print("The first 6 observations of the merged file:")## [1] "The first 6 observations of the merged file:"
## Brewery_id Beer_name Beer_ID ABV IBU
## 1 1 Get Together 2692 0.045 50
## 2 1 Maggie's Leap 2691 0.049 26
## 3 1 Wall's End 2690 0.048 19
## 4 1 Pumpion 2689 0.060 38
## 5 1 Stronghold 2688 0.060 25
## 6 1 Parapet ESB 2687 0.056 47
## Style Ounces Brewery_name
## 1 American IPA 16 NorthGate Brewing
## 2 Milk / Sweet Stout 16 NorthGate Brewing
## 3 English Brown Ale 16 NorthGate Brewing
## 4 Pumpkin Ale 16 NorthGate Brewing
## 5 American Porter 16 NorthGate Brewing
## 6 Extra Special / Strong Bitter (ESB) 16 NorthGate Brewing
## City State
## 1 Minneapolis MN
## 2 Minneapolis MN
## 3 Minneapolis MN
## 4 Minneapolis MN
## 5 Minneapolis MN
## 6 Minneapolis MN
## [1] "The last 6 observations of the merged file:"
## Brewery_id Beer_name Beer_ID ABV IBU
## 2405 556 Pilsner Ukiah 98 0.055 NA
## 2406 557 Heinnieweisse Weissebier 52 0.049 NA
## 2407 557 Snapperhead IPA 51 0.068 NA
## 2408 557 Moo Thunder Stout 50 0.049 NA
## 2409 557 Porkslap Pale Ale 49 0.043 NA
## 2410 558 Urban Wilderness Pale Ale 30 0.049 NA
## Style Ounces Brewery_name
## 2405 German Pilsener 12 Ukiah Brewing Company
## 2406 Hefeweizen 12 Butternuts Beer and Ale
## 2407 American IPA 12 Butternuts Beer and Ale
## 2408 Milk / Sweet Stout 12 Butternuts Beer and Ale
## 2409 American Pale Ale (APA) 12 Butternuts Beer and Ale
## 2410 English Pale Ale 12 Sleeping Lady Brewing Company
## City State
## 2405 Ukiah CA
## 2406 Garrattsville NY
## 2407 Garrattsville NY
## 2408 Garrattsville NY
## 2409 Garrattsville NY
## 2410 Anchorage AK
3. Address the missing values in each column.
The missing values can only be found in column IBU (1005 NA’s) and column ABV (62 NA’s), refer to red bars in the first plot. As we don’t want to delete all the missing values, we started with Option 1 to deal with the missing values, in which the state level median values were calculated for both IBU and ABV and used to replace missing values respectively, but we found that the correlation between the IBU and ABV would be heavily impacted due to this replacement.
Option 1: Simply replacing missing values in IBU/ABV with state median results in significant drop on correlation, from 67% to 49% (Refer to #7)
IBUNADF <- mergeDF %>% filter(is.na(IBU))
ABVNADF <- mergeDF %>% filter(is.na(ABV))
BOTHNADF <- mergeDF %>% filter(is.na(ABV) & is.na(IBU))
# Check which state give the most NA's
summary(IBUNADF$State)## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 8 1 4 23 48 119 21 4 1 21 9 9 5 13 52 48 4 7
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 9 31 11 20 124 9 13 0 17 29 0 16 6 0 8 3 28 17
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 8 38 53 7 9 7 1 41 15 5 10 25 45 0 3
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 25 10 5 47 183 265 27 8 2 58 16 27 30 30 91 139 23 21
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 19 82 21 27 162 55 42 11 40 59 3 25 8 8 14 11 74 49
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 19 125 100 27 14 7 6 130 26 40 27 68 87 2 15
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO
## 0 0 0 3 1 15 0 0 1 2 0 0 0 0 0 2 0 1 0 0 0 0 11 0 3
## MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV
## 0 1 4 0 4 0 0 1 1 1 0 0 0 3 0 0 0 0 6 0 0 0 0 2 0
## WY
## 0
# NADF <- data.frame(State=State, ALL=summary(mergeDF$State), IBUNA=summary(IBUNADF$State), ABVNA=summary(ABVNADF$State))
# NADF$IBUNAperc <- NADF$IBUNA/NADF$ALL
# NADF$ABVNAperc <- NADF$ABVNA/NADF$ALL
# Calculate IBU and ABV median value of each state
IBUmedianState <- mergeDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmedian=median(IBU))
ABVmedianState <- mergeDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmedian=median(ABV))
# Replace IBU NA's with state median value
mergeDF_no_IBU_NA <- left_join(mergeDF, IBUmedianState, by = "State")
mergeDF_no_IBU_NA$IBU[is.na(mergeDF_no_IBU_NA$IBU)] <- as.character(mergeDF_no_IBU_NA$IBUmedian[is.na(mergeDF_no_IBU_NA$IBU)])
aggr(mergeDF_no_IBU_NA, prop=FALSE, numbers=TRUE) # Still 7 observations IBU NA, because there is no any IBU value available for state # Replace 7 IBU NA's in SD with grand median value 35.00
mergeDF_no_IBU_NA$IBU = str_replace_na(mergeDF_no_IBU_NA$IBU, replacement = "35")
mergeDF_no_IBU_NA$IBUmedian = str_replace_na(mergeDF_no_IBU_NA$IBUmedian, replacement = "35")
# Replace ABV NA's with state median value
mergeDF_noNA <- left_join(mergeDF_no_IBU_NA, ABVmedianState, by = "State")
mergeDF_noNA$ABV[is.na(mergeDF_noNA$ABV)] <- as.character(mergeDF_noNA$ABVmedian[is.na(mergeDF_noNA$ABV)])
aggr(mergeDF_noNA, prop=FALSE, numbers=TRUE)mergeDF_noNA$IBU <- as.numeric(mergeDF_noNA$IBU)
mergeDF_noNA$ABV <- as.numeric(mergeDF_noNA$ABV)
# Plot
IBU.plotly <- mergeDF_noNA %>% ggplot(aes(x=IBU))+geom_bar()+ggtitle("IBU Distribution") +
theme(axis.text.x = element_text(size = 6, angle = 90))
ggplotly(IBU.plotly)ABV.plotly <- mergeDF_noNA %>% ggplot(aes(x=ABV))+geom_bar()+ggtitle("ABV Distribution") +
theme(axis.text.x = element_text(size = 6, angle = 90))
ggplotly(ABV.plotly)## Brewery_id Beer_name Beer_ID
## Min. : 1.0 Nonstop Hef Hop : 12 Min. : 1.0
## 1st Qu.: 94.0 Dale's Pale Ale : 6 1st Qu.: 808.2
## Median :206.0 Oktoberfest : 6 Median :1453.5
## Mean :232.7 Longboard Island Lager: 4 Mean :1431.1
## 3rd Qu.:367.0 1327 Pod's ESB : 3 3rd Qu.:2075.8
## Max. :558.0 Boston Lager : 3 Max. :2692.0
## (Other) :2376
## ABV IBU Style
## Min. :0.00100 Min. : 4.00 American IPA : 424
## 1st Qu.:0.05000 1st Qu.: 26.00 American Pale Ale (APA) : 245
## Median :0.05700 Median : 35.00 American Amber / Red Ale : 133
## Mean :0.05973 Mean : 39.81 American Blonde Ale : 108
## 3rd Qu.:0.06700 3rd Qu.: 47.00 American Double / Imperial IPA: 105
## Max. :0.12800 Max. :138.00 American Pale Wheat Ale : 97
## (Other) :1298
## Ounces Brewery_name City
## Min. : 8.40 Brewery Vivant : 62 Length:2410
## 1st Qu.:12.00 Oskar Blues Brewery : 46 Class :character
## Median :12.00 Sun King Brewing Company : 38 Mode :character
## Mean :13.59 Cigar City Brewing Company: 25
## 3rd Qu.:16.00 Sixpoint Craft Ales : 24
## Max. :32.00 Hopworks Urban Brewery : 23
## (Other) :2192
## State IBUmedian ABVmedian
## CO : 265 Length:2410 Min. :0.04000
## CA : 183 Class :character 1st Qu.:0.05500
## MI : 162 Mode :character Median :0.05700
## IN : 139 Mean :0.05677
## TX : 130 3rd Qu.:0.05800
## OR : 125 Max. :0.06250
## (Other):1406
Option 2: State level median values were only used to replace ABV missing values (62 NA’s in ABV), then we predicted the IBU values (1005 NA’s) with a Simple Linear Regression model base on the exploratory variable ABV, this way we keep the correlation between IBU/ABV no big change. This is the option that we used to get the data frame ‘tidyDF’, which is used to address the following questions.
# Calculate ABV median value of each state
ABVmedianState <- mergeDF %>% filter(!is.na(ABV)) %>% group_by(State) %>%
dplyr::summarise(ABVmedian=median(ABV))
# Replace 62 NA's in ABV with state level median
mergeDF_no_ABV_NA <- left_join(mergeDF, ABVmedianState, by = "State")
mergeDF_no_ABV_NA$ABV[is.na(mergeDF_no_ABV_NA$ABV)] <-
mergeDF_no_ABV_NA$ABVmedian[is.na(mergeDF_no_ABV_NA$ABV)]
aggr(mergeDF_no_ABV_NA, prop=FALSE, numbers=TRUE)# Build simple regression model with 1405 pairs of IBU/ABV to get the correlation coef.
mergeDF_tidy <- mergeDF %>% filter(!is.na(IBU) & !is.na(ABV))
mod <- lm(mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
summary(mod)##
## Call:
## lm(formula = mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.849 -11.977 -0.721 13.997 93.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.099 2.326 -14.66 <2e-16 ***
## mergeDF_tidy$ABV 1282.037 37.860 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4493
## F-statistic: 1147 on 1 and 1403 DF, p-value: < 2.2e-16
Intercept <- mod$coefficients[1]
Slope <- mod$coefficients[2]
# Add new column predicted_IBU base on correlation coef.
# Change negative predicted_IBU to 0, IBU is measured on a scale from 0 to infinite
mergeDF_no_ABV_NA$predict_IBU <- pmax(Intercept + Slope*mergeDF_no_ABV_NA$ABV, 0)
summary(mergeDF_no_ABV_NA$IBU)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 21.00 35.00 42.71 64.00 138.00 1005
# Replace 1005 NA's in IBU with predicted_IBU
mergeDF_no_ABV_NA$IBU[is.na(mergeDF_no_ABV_NA$IBU)] <-
mergeDF_no_ABV_NA$predict_IBU[is.na(mergeDF_no_ABV_NA$IBU)]
summary(mergeDF_no_ABV_NA$IBU)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 36.41 42.49 55.64 138.00
4. Compute the median alcohol content(ABV) and international bitterness unit (IBU) for each state. Plot a bar chart to compare.
Median ABV and IBU data have been sorted and plotted using bar char with flipped coordinates. Due to large # on bars, this makes the charts hard to read. Line chart with points will be a choice.
# plot IBU median distribution using IBUmedianState data frame
## Calculate IBU and ABV median value of each state
IBUmedianState <- tidyDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmedian=median(IBU))
ABVmedianState <- tidyDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmedian=median(ABV))
## Plot
IBUmedianState$State <- factor(IBUmedianState$State, level = IBUmedianState$State[order(IBUmedianState$IBUmedian)])
IBUmedianState %>%
ggplot(aes(x=State, y=IBUmedian, fill=State)) +
geom_bar(stat = "identity") +
ggtitle("IBU Median By States")+
# geom_text(aes(label=IBUmedian), hjust=-0.5) +
# theme(axis.text.y = element_text(size = 4)) +
# coord_flip()
# ggplotly()
geom_text(mapping = aes(label = sprintf("%0.2f", round(IBUmedian,digits = 2)), size=3), check_overlap = TRUE) +
theme(axis.text = element_text(size =7) ,legend.position = "none") +
coord_flip()# plot ABV median distribution using ABVmedianState data frame
ABVmedianState$State <- factor(ABVmedianState$State, level = ABVmedianState$State[order(ABVmedianState$ABVmedian)])
ABVmedianState %>% ggplot(aes(x=State, y=ABVmedian, fill=State)) + geom_bar(stat = "identity") + ggtitle("ABV Median By States")+
geom_text(aes(label=ABVmedian), hjust=-0.5) +
theme(axis.text.y = element_text(size = 6)) +
coord_flip()# geom_text(mapping = aes(label = sprintf("%0.3f", round(ABVmedian,digits = 4)), hjust=-0.5,size=3), check_overlap = TRUE) +
# theme(axis.text = element_text(size =7) ,legend.position = "none") +
# coord_flip()
# ggplotly()
# plot scatter with 51 pairs of median values of ABV and IBU
ABV_IBU_mergeDF <- merge(ABVmedianState, IBUmedianState, by = "State" )
ABV_IBU_mergeDF$State <- factor(ABV_IBU_mergeDF$State, droplevels(ABV_IBU_mergeDF$State))
ABV_IBU_mergeDF %>% ggplot(aes(x=IBUmedian, y=ABVmedian)) + geom_point()5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?
Maximum ABV and IBU data have been sorted and plotted.
IBUmaxState <- tidyDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmax=max(IBU))
ABVmaxState <- tidyDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmax=max(ABV))
# plot IBU max distribution using IBUmedianState data frame
IBUmaxState$State <- factor(IBUmaxState$State, level = IBUmaxState$State[order(IBUmaxState$IBUmax)])
IBUmaxState %>% ggplot(aes(x=State, y=IBUmax, fill=State)) + geom_bar(stat = "identity") + ggtitle("Max IBU By State")+
# geom_text(aes(label=IBUmax), hjust=-0.5) +
# theme(axis.text.y = element_text(size = 6) ,legend.position = "none") +
# coord_flip()
# ggplotly()
geom_text(mapping = aes(label = sprintf("%0.2f", round(IBUmax,digits = 2)), size=3), check_overlap = TRUE) +
theme(axis.text = element_text(size =7) ,legend.position = "none") +
ylab("Max IBU By State") +
coord_flip()# plot ABV max distribution using ABVmaxState data frame
ABVmaxState$State <- factor(ABVmaxState$State, level = ABVmaxState$State[order(ABVmaxState$ABVmax)])
ABVmaxState %>% ggplot(aes(x=State, y=ABVmax, fill=State)) + geom_bar(stat = "identity") + ggtitle("ABV max By State")+
# geom_text(aes(label=ABVmax), hjust=-0.5) +
# theme(axis.text.y = element_text(size = 6)) +
# coord_flip()
# ggplotly()
geom_text(mapping = aes(label = sprintf("%0.2f", round(ABVmax,digits = 2)), size=3), check_overlap = TRUE) +
theme(axis.text = element_text(size =7) ,legend.position = "none") +
ylab("Max ABV By State") +
coord_flip()Summary
Through our analysis:…
6. Comment on the summary statistics and distribution of the ABV variable.
Statistics of ABV variable can be found with summary function, box plot, bar chart and scatter plots. Both ABV and IBU with ABV showing some outliers in the 12oz cans category and IBU exhibits some outliers also in 12 OZ IPA.
7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.
Scatter plots with linear model smooth function are provided, the correlation between IBU and ABV is 0.757878, with a p-value < 2e-16, Multiple R-squared is 0.5744, meaning that 57% changes in exploratory variable ABV can explain the changes in response variable IBU.
8. Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN clustering to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand.
A category variable IPA_Ale is created to split all the 2410 beers into three groups: Ale (1007), IPA (571) and Other (832), we only use the first two groups for analysis in this part. With this KNN classification model, we get an estimated accuracy of 0.7893 with k=5.
9. Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.
We identified an external data source from the Dept of HHS which shows the level of alcohol abuse by various age group. We hypothesized that there may be some correlation between the abundant availability of alcohol, as represented by number of breweries and the number of cases of alcohol abuse.
Using a map chart we plotted the continuous variables and sure enough there visual evidence that supports our hypothesis. We also conducted a correlation analysis, and we obtained a positive linear relationship